ggml-opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill#22755
ggml-opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill#22755happyyzy wants to merge 2 commits intoggml-org:masterfrom
Conversation
|
Hi @happyyzy, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Thanks for the note. I closed the other open PR (#22117) and will focus on this smaller xmem GEMM PR first. |
|
Thank you - this is much easier. Will take a closer look in the next few days. |
|
I was able to reproduce your results on A840 with Qwen3-1.7B-f16. My device has With
build: f79069d9d (9050) Without
build: f79069d9d (9050) |
| cl_program program_mul_mv_f32_f32; | ||
| cl_program program_mul; | ||
| cl_program program_mul_mat_f16_f32_tiled; | ||
| cl_program program_adreno_xmem_gemm_f16_f32; |
There was a problem hiding this comment.
Let's remove this cl_program object. Instead, use a local cl_program in load_cl_kernels and release it when done. Something like this,
cl_program prog =
build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), CL_moe_compile_opts);
CL_CHECK((backend_ctx->kernel_gemv_moe_q4_0_f32_ns = clCreateKernel(prog, "kernel_gemv_moe_q4_0_f32_ns", &err), err));
CL_CHECK(clReleaseProgram(prog));
GGML_LOG_CONT(".");There is no need to keep the cl_program objects and we plan to remove them.
| #ifdef GGML_OPENCL_USE_ADRENO_KERNELS | ||
| backend_ctx->adreno_xmem_gemm_enabled = getenv("GGML_OPENCL_ADRENO_XMEM_GEMM") != nullptr && | ||
| backend_ctx->gpu_family == GPU_FAMILY::ADRENO && | ||
| backend_ctx->adreno_gen == ADRENO_GPU_GEN::A8X; |
There was a problem hiding this comment.
backend_ctx->gpu_family == GPU_FAMILY::ADRENO is enough and you don't need to check for A8x.
I think you can safely assume the two extensions for xmem always exist on modern Adreno GPUs (I think they even go back to A6x). In case they are not supported, the kernel compilation will fail.
|
Thanks, addressed both comments: the xmem program is now local to |
Summary
This PR adds an opt-in Adreno xmem GEMM path for OpenCL prefill matmul.
Scope:
GGML_OPENCL_USE_ADRENO_KERNELSGGML_OPENCL_ADRENO_XMEM_GEMM=1F16 x F32 -> F32GGML_OP_MUL_MATN > 1, so token-generation / GEMV decode is not routed through this pathThe implementation keeps the existing ggml tensor layout externally and uses a small bridge around the xmem GEMM:
The generic OpenCL matmul path remains unchanged unless the new runtime opt-in is set.
Results
Tested on Adreno 830 with OpenCL:
Qwen2.5 1.5B F16
Before, baseline OpenCL:
After, with
GGML_OPENCL_ADRENO_XMEM_GEMM=1:Prefill improved from
204.98 tok/sto356.19 tok/s, about1.74x.Qwen2.5 3B F16
Before, baseline OpenCL:
After, with
GGML_OPENCL_ADRENO_XMEM_GEMM=1:Prefill improved from
101.26 tok/sto163.90 tok/s, about1.62x.Decode is intentionally unchanged. Decode-only profiling confirmed that token generation stays on the existing OpenCL path (
adreno_xmemcount = 0).Correctness
Checked end-to-end generation with the xmem path enabled on Qwen2.5 1.5B F16 and Qwen2.5 3B F16. Both models produced normal decode output.
Notes
This path depends on Qualcomm Adreno OpenCL subgroup constant-load extensions and is therefore guarded behind the existing Adreno kernel build option plus an explicit runtime environment variable.
Requirements